Skip to content

[VL] Add lazy per-column deserialization for Columnar Table Cache#12211

Open
jackylee-ch wants to merge 1 commit into
apache:mainfrom
jackylee-ch:table-cache-lazy-deserialization
Open

[VL] Add lazy per-column deserialization for Columnar Table Cache#12211
jackylee-ch wants to merge 1 commit into
apache:mainfrom
jackylee-ch:table-cache-lazy-deserialization

Conversation

@jackylee-ch
Copy link
Copy Markdown
Contributor

@jackylee-ch jackylee-ch commented Jun 1, 2026

What changes

This PR makes Velox table cache write V3 per-column framed bytes by default. Lazy materialization is a base table-cache capability; spark.gluten.sql.columnar.tableCache.partitionStats.enabled now only controls the optional stats/pruning payload.

  • Removes spark.gluten.sql.columnar.tableCache.lazy.deserialization.enabled.
  • Adds V3 no-stats serialization (statsLen=0) for the default lazy path.
  • Keeps V3 with stats for partition pruning when partition stats are enabled.
  • Keeps V2 stats and legacy raw bytes as native-capability / backward-read fallback paths.
  • Routes V3 cached bytes through projected native deserialization.
  • Adds JVM/native golden, lazy serde, and GHA benchmark coverage.

Performance

Four-environment comparison — eager V2 vs lazy V3, each without and with the optional
partition-stats payload (ColumnarTableCacheLazyDeserBenchmark):

  • V2 without stats = legacy raw Presto payload (eager full-batch decode, no pruning).
  • V2 with stats = framedSerializeWithStats (eager full-batch decode + partition-stats pruning).
  • V3 without stats = per-column lazy payload (default; lazy projected decode).
  • V3 with stats = per-column lazy payload + partition-stats pruning.

100M rows / 32 partitions / 16 columns / 3 iterations, Apple M5 Pro, JDK 8 runtime, real Gluten
(off-heap enabled, ColumnarCachedBatchSerializer). Read phases build one mode's cache at a time so
the full 100M fits. Times are avg ms, lower is better; relative is vs V2 without stats.

Cache footprint (storage memory)

Mode Footprint
V2 without stats 14542 MiB
V2 with stats 14558 MiB
V3 without stats 14543 MiB
V3 with stats 14565 MiB

Footprint is identical across all four modes — V3 per-column framing does not regress cache size
for flat data, and the stats payload is negligible.

Read latency (avg ms / relative speedup vs V2 no-stats)

Phase V2 no-stats V2 +stats V3 no-stats V3 +stats
read 1/16 cols, sum(c0) 8217 (1.0x) 7427 (1.1x) 1110 (7.4x) 1050 (7.8x)
read 4/16 cols, group+agg 9325 (1.0x) 8569 (1.1x) 2648 (3.5x) 2692 (3.5x)
filter + 2/16 cols (point lookup) 8232 (1.0x) 69.5 (118x) 1210 (6.8x) 60.6 (136x)
  • Projected reads: V3 lazy decodes only the requested columns, so it is 7.4x faster reading
    1 of 16 columns and 3.5x faster reading 4 of 16, versus eager V2 which decodes all 16.
  • Filtered point lookup: partition stats prune almost all batches (V2 +stats 118x), and V3
    additionally lazy-decodes only the surviving batches' projected columns, giving the best result at
    136x (V3 with stats). Lazy column-skip alone (V3 no-stats) is 6.8x.
  • All-columns read (decode everything, no skip) was measured separately at smaller scale and is
    on par with / slightly faster than V2 (V3 ~1.3x at 2M), confirming LazyVector adds no overhead
    when every column is materialized. It is omitted from the 100M table because the eager-V2 path
    decodes the full 100M x 16 off-heap and does not fit this 64 GiB laptop.

Net: V3 lazy per-column is a large win on projected/filtered reads (the common table-cache access
pattern) with identical cache footprint and no full-scan regression.

A GitHub Actions run on a larger-RAM runner can reproduce the same 100M comparison via the
Velox Backend (x86) workflow_dispatch benchmark job.

How was this patch tested?

  • ./dev/format-scala-code.sh
  • PATH="/opt/homebrew/opt/llvm@15/bin:$PATH" ./dev/format-cpp-code.sh
  • git diff --check upstream/main..HEAD
  • ruby -e 'require "yaml"; YAML.load_file(".github/workflows/velox_backend_x86.yml"); puts "yaml ok"'
  • ./.github/workflows/util/check.sh upstream/main
  • env CCACHE_DIR=/private/tmp/gluten-ccache ninja -C cpp/build velox/tests/CMakeFiles/velox_operators_test.dir/VeloxColumnarBatchSerializerTest.cc.o
  • ./build/mvn install -pl backends-velox -am -Pspark-3.5 -Pscala-2.12 -Pbackends-velox -DskipTests -Dexec.skip
  • Local benchmark runability smoke only, not used as PR performance data: Java 8, ColumnarTableCacheLazyDeserBenchmark with 1000 rows, 4 partitions, 1 iteration, phases build,read1,read4,readAll,filter.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Codex GPT-5

@github-actions github-actions Bot added CORE works for Gluten Core VELOX DOCS labels Jun 1, 2026
@jackylee-ch jackylee-ch marked this pull request as draft June 1, 2026 04:58
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 58bd451 to d5a0502 Compare June 1, 2026 08:59
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from d5a0502 to 8e374db Compare June 1, 2026 09:05
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 8e374db to 0f0ccd2 Compare June 1, 2026 09:08
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 0f0ccd2 to 8b09d6b Compare June 1, 2026 11:21
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch marked this pull request as ready for review June 1, 2026 14:20
@jackylee-ch
Copy link
Copy Markdown
Contributor Author

@yaooqinn PTAL

@yaooqinn
Copy link
Copy Markdown
Member

yaooqinn commented Jun 2, 2026

Thanks @jackylee-ch, V3 layout is a sensible extension of the cache-stats wire we landed in #12092 / #12196. Several things to discuss before this lands:

1. Benchmark needs to be re-run. The checked-in -results.txt is 10K rows / 4 partitions / 1 iteration on an Apple M5 Pro — Stdev=0 across the board because there's only one sample. Differences in the 1-3 ms range (e.g. "1.1X" at all-16-cols read, where lazy mode physically cannot be faster than eager) are noise. Also build 1.9X is surprising because V3 does N serializeSingleColumn calls vs V2's single-pass batchSerialize — the ordering legacy > V2 > V3 doesn't match the physical work done; this needs reruns on a server / GHA-equivalent runner with iter≥3 and 100M rows / 32 partitions (matching the code defaults). Please also add a cache memory footprint column — V3 per-col framing + getFlattenedRowVector() flattening Dictionary/Constant encodings could regress cache size significantly for dict-encoded payloads, and that's currently unmeasured.

2. Do we really need a new SQLConf? V3 functionally supersedes V2 (V3 frames also carry statsBlob), so this isn't a new behavioral feature — it's a wire-format upgrade. Adding a dedicated lazy.deserialization.enabled boolean commits Gluten to maintaining three cache paths (legacy / V2-stats / V3-lazy-and-stats) and a three-level fallback chain. Once we trust V3, we'd want to deprecate V2-stats, which means another deprecation cycle. Could we either (a) skip the conf and gate V3 behind partitionStats.enabled once it's stable, or (b) turn partitionStats.enabled into a string conf with off | v2 | v3 values? Configuration.md already warns "V3 is NOT backward compatible with V2 readers" + default=false — operationally nobody is going to flip this, so the conf risks being long-lived dead code.

3. Cross-language test parity vs #12196. V3 has no cpp-side byte-equal golden test; JVM-side tests synthesize their own frames via craftV3Framed. We just established the cpp-golden ↔ JVM-parser round-trip pattern in #12196 specifically because layout drift between halves is a correctness hazard. V3 needs the same: a framedSerializeWithStatsV3Golden cpp test pinning a byte-stable literal + a JVM parser round-trip over that same literal.

4. Smaller items.

  • All-null column case not covered (we hit the PrestoSerde uninit-values bug in [VL] Add min/max partition stats to columnar InMemoryRelation cache for partition pruning #12092 development, same risk class for per-col path).
  • getFlattenedRowVector() side effect on Dictionary/Constant encoding not documented.
  • The // JNI pin outlives comment in deserializeV3 describes a non-issue (copies are made synchronously in step 6, the lazy loader doesn't depend on the pin) — please trim.
  • Two near-identical magic checks (parseFramedBytes byte[3] dispatch vs isV3Format 4-byte compare) — please consolidate.
  • Consider folding statsExtV3AvailableFlag and statsExtAvailableFlag into a single capability enum (Unknown | V2 | V3 | Unavailable) — two independent one-shot latches double the operational diagnosis surface.

Happy to file any of these as separate issues if it helps.

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 8b09d6b to 09679ee Compare June 2, 2026 06:24
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 09679ee to ab9e0f7 Compare June 2, 2026 06:30
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from ab9e0f7 to 144e816 Compare June 2, 2026 06:47
@github-actions github-actions Bot removed the CORE works for Gluten Core label Jun 2, 2026
@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch 2 times, most recently from b77f4ab to 9a0f96a Compare June 2, 2026 07:28
@github-actions github-actions Bot removed the DOCS label Jun 2, 2026
@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 9a0f96a to b5b1906 Compare June 2, 2026 09:01
@github-actions github-actions Bot added the INFRA label Jun 2, 2026
@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch 3 times, most recently from 2b96545 to c3cc1bd Compare June 2, 2026 15:28
@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 97a6019 to 9971c91 Compare June 3, 2026 03:52
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 9971c91 to f576df8 Compare June 3, 2026 06:33
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from f576df8 to f17dc6a Compare June 3, 2026 06:51
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from f17dc6a to cda20eb Compare June 3, 2026 09:27
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch 3 times, most recently from decdd0e to ab055c5 Compare June 3, 2026 14:16
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Run Gluten Clickhouse CI on x86

2 similar comments
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from ab055c5 to 2538fe5 Compare June 3, 2026 18:55
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 2538fe5 to 765794f Compare June 4, 2026 04:35
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 4, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 765794f to c7f9e2f Compare June 4, 2026 07:54
@github-actions github-actions Bot removed the INFRA label Jun 4, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 4, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from c7f9e2f to 42c1b15 Compare June 5, 2026 02:21
Copilot AI review requested due to automatic review settings June 5, 2026 02:21
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 5, 2026

Run Gluten Clickhouse CI on x86

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Velox-backed Spark table cache to default to a new V3 framed wire format that stores per-column serialized payloads, enabling lazy materialization (projected native deserialization) while keeping partition stats as an optional pruning payload.

Changes:

  • Add V3 per-column framed cache serialization (with and without stats) plus native projected deserialization path.
  • Update cache filtering/pruning behavior and tests to reflect new lazy/V3 default and revised float NaN stats semantics.
  • Add cross-language golden framing tests, end-to-end lazy serde tests, and a benchmark to compare V2 vs V3 (with/without stats).

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
gluten-ut/spark41/src/test/scala/org/apache/spark/sql/GlutenCachedTableSuite.scala Adjust expected cached size stats for Spark 4.1 tests.
gluten-ut/spark40/src/test/scala/org/apache/spark/sql/GlutenCachedTableSuite.scala Adjust expected cached size stats for Spark 4.0 tests.
gluten-ut/spark35/src/test/scala/org/apache/spark/sql/GlutenCachedTableSuite.scala Adjust expected cached size stats for Spark 3.5 tests.
gluten-ut/spark34/src/test/scala/org/apache/spark/sql/GlutenCachedTableSuite.scala Adjust expected cached size stats for Spark 3.4 tests.
gluten-ut/spark33/src/test/scala/org/apache/spark/sql/GlutenCachedTableSuite.scala Adjust expected cached size stats for Spark 3.3 tests.
gluten-substrait/src/main/scala/org/apache/gluten/config/GlutenConfig.scala Update config description to reflect V3 lazy-by-default behavior and stats as optional payload.
gluten-arrow/src/main/java/org/apache/gluten/vectorized/ColumnarBatchSerializerJniWrapper.java Add JNI APIs for V3 serialize (no-stats / with-stats) and projected deserialization.
docs/Configuration.md Update rendered config table/description for table cache partition stats behavior with V3 default.
cpp/velox/tests/VeloxColumnarBatchSerializerTest.cc Extend native tests: NaN stats semantics, V3 golden frames, V3 writer layout + projected/lazy read fixtures.
cpp/velox/operators/serializer/VeloxColumnarBatchSerializer.h Define V3 framed serialization/deserialization interfaces and projection-aware V3 deserializer.
cpp/velox/operators/serializer/VeloxColumnarBatchSerializer.cc Implement V3 per-column framing, lazy column loaders, projected V3 deserialization, and refine stats logic.
cpp/core/operators/serializer/ColumnarBatchSerializer.h Add V3 virtual hooks and default behavior (unsupported backends return empty/throw).
cpp/core/jni/JniWrapper.cc Add JNI entry points for V3 serialization and projection-aware V3 deserialization; refactor byte[] creation helper.
backends-velox/src/test/scala/org/apache/spark/sql/execution/ColumnarCachedBatchSerializerHelperSuite.scala Add unit coverage for V3 capability latch and fallback behaviors.
backends-velox/src/test/scala/org/apache/spark/sql/execution/ColumnarCachedBatchLazySerdeTest.scala New E2E tests validating V3 lazy cache bytes, projected reads, pruning, and compatibility behaviors.
backends-velox/src/test/scala/org/apache/spark/sql/execution/ColumnarCachedBatchFramedBytesSuite.scala Add JVM-side V3 frame parsing tests + cross-language V3 golden fixtures.
backends-velox/src/test/scala/org/apache/spark/sql/execution/ColumnarCachedBatchE2ESuite.scala Update/add E2E tests for NaN behavior and V3 no-stats default path correctness.
backends-velox/src/test/scala/org/apache/spark/sql/execution/ColumnarCachedBatchBuildFilterPruneSuite.scala Add regression tests for null-bound referenced-column pruning bypass.
backends-velox/src/test/scala/org/apache/spark/sql/execution/benchmark/ColumnarTableCacheLazyDeserBenchmark.scala New benchmark comparing V2 vs V3 (with/without stats), including cache footprint + read phases.
backends-velox/src/main/scala/org/apache/spark/sql/execution/ColumnarCachedBatchSerializer.scala Add V3 magic parsing/routing, V3 serialization path, and V3 projected native deserialization integration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Write V3 per-column cache bytes by default for Velox table cache. Partition stats now only controls the optional stats/pruning payload: stats off writes a no-stats V3 frame, stats on writes V3 with stats, and older native libraries still fall back to V2 stats or legacy bytes.

Add the V3 no-stats JNI/native serializer, JVM parsing for statsLen=0, cross-language golden coverage, and GitHub Actions benchmark execution without committing local benchmark results.

Change-Id: I2a8582f901fafd436cac1a1d16e0367e9330b336
@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 42c1b15 to 1be78bc Compare June 5, 2026 02:38
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 5, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch
Copy link
Copy Markdown
Contributor Author

cc @yaooqinn PTAL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core DOCS VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants